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Abstract 


We  propose  a  simple  parallel  algorithm  for  finding  a  blocking  flow  in  an  aicyclic  network.  On 
an  n-vertex,  m-arc  network,  our  algorithm  runs  in  C>(n  log  n)  time  and  0{nm)  space  using  an  m- 
processor  EREW  PRAM,  A  consequence  of  our  algorithm  is  an  0(n^(log  n)  log(rjC))-time,  0{nm)- 
space,  m-processor  2ilgorithm  for  the  minimum-cost  circulation  problem,  on  a  network  with  integer 
arc  capacities  of  magnitude  at  most  C.  , 


1  Terminology 

In  this  paper  we  use  the  following  definitions.  Let  G  =  (V,  E)  be  an  acyclic  directed  graph  with  vertex 
set  V  of  size  n  and  arc  set  E  of  size  m.  For  ease  in  stating  time  bounds,  we  assume  that  m  >  n  -  i. 
Define  ^  =  {(tn,  t;)|(j;,  in)  €  E}  and  E'^  =  E\JE~^.  For  any  vertex  w  we  denote  by  E{w)  the  set  of 
vertices  adjacent  out  from  w,  E(w)  =  a;)  €  £},  and  by  E~^{w)  the  set  of  vertices  adjacent  into 

w,  E~^{w)  =  {fj(i;,  in)  €  E}.  Graph  G  is  layered  if  each  vertex  v  can  be  assigned  an  integer  layer  L(v) 
such  that  L(w)  =  L{v)  +  1  for  every  arc  (u,tn). 

Graph  G  is  a  nei  wrk  if  it  has  two  distinguished  vertices,  a  source  s  and  a  sink  t,  and  a  nonneg¬ 
ative  real-valued  capacity  u{v,w)  on  every  arc  (n,tn).  A  preflow  on  a  network  is  a  nonnegative  real¬ 
valued  function  /  on  the  arcs  such  that  f{v,w)  <  u{v,tv)  for  every  arc  (n.in)  and  - 

Ei€£:(w)  for  ^very  vertex  in  7^  s.  The  quantity  e(in)  =  ^v£5->(v.)  fiv,w)  -  Ere£(tu)  's 

called  the  excess  at  vertex  in.  A  preflow  /  is  a  flow  if  e(tn)  =  0  for  every  vertex  in  ^  {s,  t}. 

The  residual  capacity  of  an  arc  (n,in)  with  respect  to  a  preflow  /  is  11/(1;, in)  =  u{v,w)  -  f{i\u'). 
Arc  (v,w)  is  saturated  if  ii/(n,  in)  =  0  and  residual  if  ii/(i;,  in)  >  0.  A  preflow  is  blocking  if  every  path 

in  G  from  s  to  t  contains  at  least  one  saturated  arc,  i.e.,  there  is  no  path  of  residual  arcs  from  s  to  i. 

Out  model  of  parallel  computation  is  the  exclusive-read,  exclusive- write  parallel  random-access  ma¬ 
chine  (EREW  PRAM)  [7].  We  shall  also  briefly  consider  distributed  computation  models  [10]. 

2  Perspective 

The  problem  of  finding  a  blocking  flow  in  an  acyclic  network  arises  as  a  subproblem  in  computing 
maximum  flows  and  in  computing  minimum-cost  circulations.  Specifically,  Dinic  [5]  showed  that  the 
maximum  flow  problem  can  be  solved  by  solving  a  sequence  of  0(n)  blocking  flow  problems  on  layered 
networks.  We  [12,  13,  14]  have  shown  that  the  minimum-cost  circulation  problem  can  be  solved  by  solv¬ 
ing  a  sequence  of  0(nlog(nC))  blocking  flow  problems  on  ar:yclic  but  not  necessarily  layered  networks. 
In  the  latter  bound,  C  is  the  majcimum  absolute  value  of  an  arc  cost;  all  arc  costs  are  assumed  to  be 
integers. 

Motivated  by  Dinic’s  discovery,  several  researchers  have  developed  algorithms  for  finding  a  blocking 
flow  in  a  layered  network  [2,  5,  ^1;  9.  16, 18, 21,  22,  23, 24].  Many  of  these  algorithms,  e.g.  [5,  9, 16,  18,  22, 
23,  24],  work  with  the  same  asymptotic  efflciei.  v  on  arbitrary  acyclic  networks  as  on  layered  networks. 
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The  asymptotically  ftistest  known  sequential  algorithm  is  described  in  [14];  it  runs  in  0(m log(n^/m)) 
time  and  0{m)  space,  for  an  arbitrary  acyclic  network. 

Of  the  cited  algorithms,  only  one  is  a  parallel  algorithm,  that  of  Shiloach  and  Vishkin  [21],  which 
runs  in  0(n  log n)  time  and  O(n^)  space  (U.  Vishkin,  private  communication,  1986)  using  n  processors. 
The  Shiloach-Vishkin  algorithm  is  stated  for  layered  networks.  Although  we  previously  claimed  that 
their  method  extends  to  arbitrary  acyclic  networks  without  loss  of  asymptotic  efficiency[l2,  13],  this 
does  not  seem  to  be  true;  their  running  time  analysis  breaks  down  in  the  general  case.  Thus  their 
algorithm  cannot  be  used  as  an  efficient  subroutine  in  solving  minimum-cost  circulation  problems. 

Our  goal  in  this  paper  is  to  devise  a  fast  parallel  blocking  flow  algorithm  for  arbitrary  acyclic 
neiv.'crks.  In  the  next  section  we  describe  a  method  based  on  th»  ronceot  of  atrims;  w'o  rail  tbU 

method  the  atomic  method.  In  Section  4  we  give  a  parallel  implementation  of  the  atomic  method. 
This  implementation  runs  in  0{nlogn)  time  and  0{nm)  space  using  m  processors.  As  a  corollary,  wc 
obtain  an  0(7i^(log  n) log(nC))-time,  0(nm)-space,  m-processor  parallel  algorithm  for  the  minimum- 
cost  circulation  problem.  (See  [12,  13,  14].) 


3  The  Atomic  Method 

In  this  section  we  describe  a  method  for  finding  blocking  flows  in  acyclic  networks  that  is  based  on  the 
concept  of  atoms  (defined  below).  Atoms  have  been  used  previously  in  the  analysis  of  ma.ximum  flow 
algorithms  by  Goldberg  [11]  and  Cheriyan  and  Maheshwari  [1]. 

Our  general  method  is  the  same  as  that  used  by  Kaxzarov  [16]  and  later  by  others,  e.g.  [2,  8,  14, 
21,  24],  The  algorithm  begins  with  a  blocking  preflow  and  moves  flow  excess  through  the  network  while 
maintaining  a  blocking  preflow,  until  eventually  this  flow  movement  produces  a  blocking  flow.  The 
algorithm  maintains  a  partition  of  the  vertices  into  two  states:  blocked  and  unblocked.  We  call  an  arc 
(v,w)  admissible  if  it  is  residual  and  w  is  unblocked.  The  algorithm  blocks  a  vertex  v  when  it  discovers 
that  none  of  the  arcs  leaving  v  is  admissible;  once  v  is  blocked,  every  path  from  v  to  t  contains  a 
saturated  arc.  Excess  on  blocked  vertices  is  returned  from  whence  it  came,  by  decreasing  ihe  flow  on 
appropriate  incoming  arcs. 

To  keep  track  of  the  detailed  flow  movements,  the  algorithm  maintains  a  partition  of  the  flow  excess 
into  atoms.  Consider  a  time  during  an  execution  of  the  algorithm.  An  atom  is  a  maximal  quantity  of 
excess  that  has  moved  in  exactly  the  same  way  so  far.  An  atom  a  at  a  vertex  v  consists  of  an  amount 
of  excess  denoted  by  size(a);  the  vertex  v  is  denoted  by  position(a).  An  atom  located  at  a  vertex  other 
than  s  or  t  is  called  active. 

Associated  with  an  atom  a  at  a  vertex  u  is  a  path  of  arcs  in  E'^  from  s  to  r  that  the  atom  followed 
in  arriving  at  v.  This  path  is  denoted  by  trace{a).  Also  associated  with  a  is  a  simple  path  from  s  to 
V,  denoted  by  path{a),  of  arcs  in  E  through  which  the  atom  moved  forward  but  not  backward  in  the 
course  of  reaching  v  from  s.  The  relationship  between  trace{a)  and  paih(a)  is  that  path{a)  contains  each 
arc  (v,w)  such  that  (v,w)  but  not  {w,v)  is  on  trace{a).  The  intuition  behind  the  algorithm  is  that 
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procedure  Process- Aiom{a). 
begin 

w  •<—  posUion{a)-, 
if  tf  is  unblocked  then 

if  3  {w,x)  :  iif(iu,x)  >  0  and  x  is  unblocked  then  begin 
if  size(a)  >  u/(w,x)  then  begin 
[split  a] 

create  a  new  atom  a'; 
paik(a)  *—  paih(a'); 
stze(a')  *—  size(a)  —  u/(w,x); 
size(a)  *-  Uf(w,x); 
end; 

position(a)  •<—  x; 
append  (u),x)  to  paih{a)\ 
uj(w,x)  <—  uj{w,x)  —  si2e(a); 

end 

else  mark  w  as  blocked; 
if  w  is  blocked  then  begin 

1—  last  arc  on  path(a)\ 
postiwn(a)  <—  v; 
delete  (v,  w)  from  patk(a); 
move  a  to  v; 
update  paih{a)-, 
uj(v,  w)  <—  Uf{v,  w)  +  siz€(a); 
end; 
end. 


Figure  1:  The  Process-Atom  procedure.  Note  that  the  flow  is  maintained  implicitly  as  a  difference  between 
u  and  uj. 


each  atom  does  a  depth-first  search  from  5  in  an  attempt  to  reach  t.  The  graph  being  searched  changes 
dynamically  as  arcs  become  saturated  and  vertices  become  blocked. 

During  initialization,  the  algorithm  saturates  every  arc  (s,  v)  leaving  the  source,  creating  at  each 
neighbor  u  of  s  an  atom  of  size  u{s,v)  and  trace  (s,n).  At  each  iteration,  the  algorithm  selects  an  active 
atom  a  and  processes  it  as  described  in  Figure  1.  Let  w  =  position{a).  If  w  is  not  blocked,  the  algorithm 
tries  to  move  a  forward  along  an  arc  with  positive  residual  capacity.  If  no  such  arc  exists,  w  becomes 
blocked.  If  there  is  such  an  arc,  the  algorithm  picks  one,  say  (w,x).  If  size(a)  >  Uf(w,x),  atom  a  is 
split  into  two  parts.  One  part,  of  size  equal  to  size(a)  —  Uf(w,x),  gets  a  new  name  a'.  The  other  part, 
of  size  equal  to  u/(w,x),  retains  the  name  a.  Atom  o'  remains  at  vertex  w  to  be  processed  later;  atom 
a  moves  to  vertex  x.  Finally,  if  atom  a  has  not  moved  (j.e.,  vertex  w  is  blocked),  atom  a  is  returned  to 
the  vertex,  say  t>,  from  which  it  first  reached  w. 

Note  that  an  atom  can  move  in  two  ways:  forward  from  lu  to  i  or  backward  from  w  to  v.  In  the 
former  case,  w  is  unblocked  and  f{w^x)  increases.  In  the  latter  case,  w  is  blocked  and  f{v,w)  decreases. 
An  atom  can  move  backward  from  w  to  v  only  if  at  a  previous  time  it  moved  forward  from  v  to  w. 
Thus  the  flow  through  an  arc  never  becomes  negative.  During  the  course  of  the  algorithm,  for  any  arc 
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(u>,x),  the  flow  on  (tn,i)  first  increases,  until  x  becomes  blocked,  after  which  the  flow  decreases. 

Note  that  we  have  not  specified  the  way  in  which  we  select  an  atom  to  be  processed  next.  In  the 
parallel  implementation  of  the  algorithm,  all  active  atoms,  and  atoms  arising  from  them  by  iterated 
splitting,  are  processed  concurrently.  In  the  sequential  implementation,  any  constant-time  selection  rule 
leads  to  an  0(nm)  time  bound.  For  e.xample,  one  can  maintain  the  set  of  active  atoms  as  a  queue  or 
a  stack.  Alternatively,  at  each  vertex  one  can  maintain  a  list  of  the  atoms  located  at  the  vertex,  and 
keep  a  queue  or  a  stack  of  vertices  with  nonempty  lists  of  atoms. 

VVe  begin  our  analysis  of  the  algorithm  by  bounding  the  number  of  atoms. 

Lemma  3.1  The  total  number  of  atoms  created  during  an  execution  of  the  atomic  algorithm  is  at  most  m. 

Proof:  We  claim  that  each  increase  in  the  number  of  atoms  corresponds  to  a’'  arc  saturation.  Atoms 
created  during  initialization  are  charged  to  the  saturation  of  the  corresponding  arcs.  An  rr.''atcd  by 
splitting  in  procedure  Process-Atom  is  charged  to  the  saturation  of  the  arc  (u’,j)  in  the  same  execuiic’ 
of  the  procedure.  Thus  the  claim  is  true.  Since  each  arc  becomes  saturated  only  once,  the  lemma  is 
true.  I 

The  next  lemma  gives  the  key  property  of  the  algorithm.  Intuitively,  the  lemma  holds  because  the 
trace  of  an  atom  is  a  partial  traversal  of  a  tree  rooted  at  s. 

Lemma  3.2  Consider  an  atom  a  at  some  time  during  execution  of  the  algorithm.  Then  the  length  of  the 
trace  of  a  is  at  most  2n  -  3. 

Proof :  An  atom  a  only  moves  backward  from  a  vertex  w  ^  {s,0  once  w  is  blocked.  Just  after  a  moves 
backw'ard  from  tu,  w  is  not  on  path{a),  and  a  never  visits  w  again.  It  follows  that,  for  each  vertex  tc  ^  t, 
Ef\trace{a)  contains  at  most  one  arc  of  the  form  {v,w);  and,  for  each  vertex  w  ^  E~^  f]trace(a) 

contains  at  most  one  arc  of  the  form  (w,v).  This  gives  a  bound  of  2n  -  3  on  the  1'  ngth  of  trace{a).  | 

We  define  phases  of  the  algorithm  as  follows.  Initialization  is  phase  1.  Phase  i  for  i  >  1  begins 
at  the  end  of  phase  i  —  I  and  ends  as  soon  as  every  atom  that  existed  at  the  end  of  the  phase  i  -  1, 
and  every  atom  created  by  splitting  since  the  end  of  phase  i  —  1,  has  moved  at  least  one  step.  Since 
every  atom  moves  (either  forward  of  backward)  at  least  once  during  each  phase,  we  have  the  following 
corollary,  which  is  crucial  for  the  analysis  of  parallel  versions  of  the  atomic  method. 

Corollary  3.3  The  number  of  phases  during  an  execution  of  the  algorithm  is  at  most  2n  -  3. 

To  obtain  an  efficient  implementation  of  the  algorithm,  we  maintain  the  path  of  each  atom  as  a 
stack  of  arcs.*.  When  an  atom  moves  forward  along  an  arc,  the  arc  is  pushed  on  top  of  the  stack.  To 
move  an  atom  backward,  we  move  it  to  the  tail  vertex  of  the  top-of-stack  arc  and  pop  the  stack. 

Using  stacks  allows  the  algorithm  to  move  atoms  forward  and  backward  in  constant  time.  Splitting 
an  atom,  however,  requires  copying  a  stack.  For  ordinary  stacks,  this  requires  linear  time.  A  very  simple 

'if  the  network  has  no  multiple  arcs,  it  is  sufficient  to  maintain  stacks  of  vertices  on  the  paths. 
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implementation  of  persistent  stacks  [19]  (see  also  [6])  allows  the  copy  operation,  as  well  as  the  push  and 
pop  operations,  to  be  done  in  constant  time.  In  combination  with  Lemmas  3.1  and  3.2,  this  fact  gives 
the  following  result. 

Theorem  3.4  The  atomic  algorithm,  implemented  using  persistent  stacks,  runs  in  G(nTn)  time. 


4  A  Parallel  Implementation 

In  this  section  we  describe  a  parallel  implementation  of  the  atomic  method.  The  parallel  implementation 
works  in  pulses;  at  each  pulse,  every  atom,  including  those  arising  by  splitting,  moves  either  forward  or 
backward  or  both.  Thus  each  pulse  completes  at  least  one  phase,  where  a  phase  is  as  defined  in  Section 
3. 

The  parallel  implementation  consists  of  the  following  four  steps; 

Step  1  (initialize).  For  each  arc  (s.u),  set  f{s,v)  =  u(s,i;).  Create  an  atom  at  v  of  size  f{s,v)  and  having  stack 
containing  only  (s,v).  For  each  arc  (f,u))  with  v  ^  s,  set  f(v,w)  =  0.  Block  vertex  s  and  unblock  all  other 
vertices. 

Step  2  (push  flow  forward).  For  each  unblocked  vertex  w  ^  {s,t],  in  parallel,  do  the  following. 

Arbitrarily  order  the  atoms  at  w,  say  ai.aj.  ■  ■  -  lOt,  and  the  admissible  arcs  (w,  x),  say  (id,  Xi),  (w.xj), . . ., 
(x,wi).  For  I  <  j  <  k,  compute  a  cumulative  size  S{j)  =  sj2e(ai).  For  each  1  <  j  <  I,  compute  a 

cumulative  residual  capacity  R(j)  =  J2i=t  Assign  the  atoms  a;  to  the  admissible  arcs  (w,Xj)  as 

follows: 

1.  If  S(i)  —  size(ai)  >  R{j)  —  uj{w,Xj)  and  5(i)  <  R{j),  assign  all  of  atom  Oj  to  (u;,X;). 

2.  If  S(t)  -  si2e(ai)  >  R{j)  -  Uf{ui,Xj)  and  S(i)  >  R{j),  assign  an  amount  R{j)  ~  S{i)  +  stze{ai)  of  atom 
a,-  to  {w,Xj). 

3.  If  S{i)  —  size(a,)  <  R(j)  —  u/(w,Xj)  and  5(r)  >  R(j),  assign  an  amount  uj{w,Xj)  of  atom  a,  to 
{w,xj). 

4.  If  S(»)  -  sz2e(a,)  <  R{j)  -  u/(w,Xj)  and  5(i)  <  R(j),  assign  ^ln  amount  ^(i)  -  R(j)  +  uj(xr,xj)  of 
atom  Oj  to  {w,Xj). 

(This  assignment  associates  with  each  admissible  arc  a  total  amount  of  excess  less  than  or  equal  to  its 
residual  capacity.  At  most  one  such  arc  receives  an  amount  that  is  positive  but  less  than  its  re.sidual 
capacity.  The  total  amount  assigned  to  admissible  arcs  equals  the  minimum  of  the  excess  at  w  and  the 
sum  of  the  residual  capacities  of  the  admissible  arcs  (w,xj).) 

Split  any  atom  assigned  to  more  than  one  arc  into  two  or  more  atoms,  one  per  assigned  arc,  each  of  size 
equal  to  the  amount  of  the  original  atom  assigned  to  the  arc.  Each  of  the  new  atoms  inherits  the  assignment 
of  the  corresponding  amount  of  the  old  atom,  as  well  as  the  a  copy  of  the  stack  of  the  old  atom. 

For  each  admissible  arc  (w,Xj),  increase  f{w,Xj)  by  the  sum  of  the  sizes  of  the  atoms  assigned  to  (u’.Xj), 
and  move  each  such  atom  to  xj ,  pushing  (tu,  Xj)  to  its  stack.  If  all  arcs  (w,  Xj)  are  now  saturated,  mark  w 
to  be  blocked.  (Do  not  block  w  yet.) 
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Step  3  (block  vertices).  Block  every  vertex  marked  to  be  blocked  in  Step  2. 

Step  4  (return  flow).  For  each  blocked  vertex  w  ^  {«,<},  in  parallel,  do  the  following: 

For  each  atom  a  at  u;,  let  (va,w}  be  the  top  arc  on  siack(a).  Pop  siack(a).  Decrease  f(Va,w)  by  size{a) 
and  move  a  to  t;. 

Step  5  (loop).  If  every  atom  is  at  s  or  t,  stop.  Otherwise,  go  to  Step  2. 

By  using  standard  techniques  of  parallel  computation  [15],  including  fast  sorting  [3],  parallel  prefix- 
computations  [17],  and  computations  based  on  complete  binary  trees  [25],  one  can  implement  each  step 
of  the  algorithm  to  run  in  O(logn)  time  on  an  m-processor  PRAM.  (Lemma  3.1  implies  that  only  m 
processors  axe  necessary.)  The  details  are  routine  and  we  omit  them;  the  reader  can  refer  to  [12,  21]  for 
more  details  on  how  to  use  these  techniques  to  implement  flow  algorithms. 

The  running  time  of  the  entire  algorithm  is  then  0(n  log  n)  by  Corollary  3.3.  The  space  required  is 
dominated  by  the  space  for  the  paths  of  atoms,  which  is  0{nm). 


5  Distributed  Implementation 

The  atomic  method  has  a  natural  implementation  in  a  distributed  model  of  computation,  due  to  the 
robustness  of  the  order  in  which  aetive  atoms  are  processed.  By  Lemma  3.2,  a  straightforward  im¬ 
plementation  of  the  atomic  algorithm  on  either  a  synchronous  or  an  asynchronous  distributed  model 
of  computation  [10]  works  in  0(n)  message-passing  rounds  using  0(nm)  messages.  We  can  thread  the 
persistent  stacks  representing  the  paths  through  the  vertices  to  obtain  an  0(m)  space  bound  per  vertex. 

Recall  that  w,;  would  like  to  use  the  blocking  flow  algorithm  as  a  subroutine  in  our  minimum-cost  cir¬ 
culation  method  [12,  13,  14].  In  order  to  do  this,  we  need  to  add  termination  detection  to  our  distributed 
algorithm  (so  that  the  processors  know  when  to  start  the  next  stage  of  the  minimum-cost  circulation 
algorithm).  The  termination  detection  can  be  obtained  without  increasing  the  asymptotic  time  bounds 
by  using  the  technique  of  Dijkstra  and  Scholten  [4]  for  detecting  termination  of  diffusing  computations 
(a  simpler  termination  detection  technique  specific  to  minimum-cost  circulation  algorithms  is  discussed 
in  [12]).  The  Cijkstra-Scholten  technique  works  for  algorithms  with  a  single  initiator.  This  is  not  a 
problem  for  the  blocking  flow  algorithm  described  in  this  paper,  since  the  algorithm  is  initiated  by  the 
source  processor.  The  version  of  the  problem  that  comes  up  in  the  execution  of  the  minimum-cost 
circulation  algorithm,  however,  has  several  capacitated  sources  instead  of  a  single  uncapacitated  source. 
Therefore,  we  need  to  construct  a  spanning  tree  in  the  network  and  select  a  leader  before  running 
the  minimum-cost  circulation  algorithm.  Even  in  the  asynchronous  model,  this  preprocessing  can  be 
done  in  <9(nlogn)  time,  which  is  dominated  by  the  (?(n^  log(nC))  running  time  of  the  minimum-cost 
circulation  algorithm. 

Note  that  the  above  bounds  for  distributed  computation  are  not  very  good  from  the  theoretical 
viewpoint.  We  do  just  as  well  by  sending  all  the  information  about  the  network  to  a  single  vertex 
and  letting  it  do  all  the  computation.  In  practice,  however,  our  distributed  algorithm  should  be  more 


6 


efficient  than  such  a  centralized  computation. 


6  Concluding  Remarks 

In  conclusion,  we  would  like  to  discuss  some  open  questions  related  to  the  problems  studied  in  this 
paper. 

The  parallel  complexity  of  the  blocking  flow  problem  (in  layered,  acycbc.  and  general  networks)  is 
wide  open.  This  problem  is  not  known  to  be  in  NC;  nor  is  it  known  to  be  P-complete.  Resolving  either 
of  these  questions  seems  to  be  hard.  A  possibly  simpler  question  is  whether  an  0(n‘^)-time  blocking 
flow  cdgorithm  for  0  <  f  <  1  exists. 

Orlin's  minimum-cost  circulation  algorithm  [20],  implemented  using  the  best  parallel  shortest  path 
algorithm  currently  known,  solves  the  minimum-cost  circulation  problem  in  O(r7rlog^n)  time  using 
n^/logn  processors.  Although  for  most  possible  values  of  n,m,  and  C,  this  time  bound  is  better 
then  the  time  bound  achieved  by  our  minimum-cost  circulation  algorithm  discussed  in  Section  2.  our 
algorithm  is  more  practical  since  it  uses  only  m  processors. 

There  are  some  obvious  inefficiencies  in  our  algorithi:  .  Though  the  running  time  is  faster  than 
that  of  our  sequential  algorithm  [14]  by  a  factor  of  mlog(n^/m)/(n  log  n),  the  total  work  doiie  by  the 
algorithm  (the  product  of  the  running  time  and  the  number  of  processors)  is  0(n^m  log  n),  a  factor  of 
n  logn/ log(n^/m)  worse  than  that  of  our  sequential  algorithm.  The  sequential  algorithm  uses  much 
more  complicated  data  structures,  however.  If  only  simple  data  structures  are  used,  the  running  time 
bound  of  our  sequential  algorithm  increases  by  a  factor  of  m/(nlog(n^/m)).  Even  then,  the  total  work 
done  by  the  parallel  algorithm  is  greater  by  a  factor  of  (m  logn)/n.  The  atomic  method  can  be  improved 
by  combining  atoms  that  are  at  the  same  vertex  at  the  same  time  and  moving  them  forward  together, 
thprch\  rofjitrlntr  the  number  of  forward  flow  pushes.  Also,  if  some  of  the  excess  at  a  vertex  v  is  to 
be  returned  from  v,  it  does  not  matter  which  part  of  the  excess  is  selected  for  returning,  since  there  is 
only  one  kind  of  commodity  involved.  It  is  easy  to  design  an  improved  algorithm  based  on  these  ideas, 
but  we  have  been  unable  to  obtain  any  improvement  in  our  asymptotic  resource  bounds  by  doing  so. 
The  Shlloach-Vishkin  result  [21]  suggests  the  possible  existence  of  a  blocking  flow  algorithm  for  acyclic 
networks  running  in  O(nlogn)  time  and  O(n^)  spa'-e  using  r,  prorescorc.  Finding  such  an  algorithm, 
or  disproving  its  existence,  is  a  challenging  open  problem. 
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